AI Deep Dive — CTO Edition

The AI Hierarchy — What Fits Where

Every AI buzzword maps to a specific layer in a hierarchy. Understanding this hierarchy is the single most important thing before you walk into any AI conference.

Key Distinction: AI vs ML vs Deep Learning

Concept	What It Is	Example
AI	Any system that performs tasks typically requiring human intelligence. Includes rule-based systems.	A chess engine with hardcoded rules. An if-else fraud filter.
Machine Learning	A subset of AI. The system learns patterns from data instead of being explicitly programmed.	Spam filter that learns from labeled emails. Recommendation engines.
Deep Learning	A subset of ML using neural networks with many layers. Excels at unstructured data.	ChatGPT, image recognition, voice assistants.
Generative AI	A subset of DL that creates new content — text, images, code, audio.	Claude writing an email. DALL-E generating an image.

CTO mental model: All generative AI is deep learning. All deep learning is ML. All ML is AI. But not all AI is ML — rule-based expert systems are AI but not ML.

Types of Machine Learning

📊

Supervised Learning

Learns from labeled examples. Input → known output. Used for classification, regression, forecasting.

🔍

Unsupervised Learning

Finds patterns in unlabeled data. No "right answer" given. Used for segmentation, anomaly detection.

🎮

Reinforcement Learning

Agent learns by trial-and-error, maximizing a reward signal. Used for game AI, robotics, RLHF for LLMs.

🔄

Self-Supervised Learning

The model creates its own labels from the data. "Predict the next word." This is how LLMs are pre-trained.

How AI Actually Learns — From Data to Intelligence

Neural Networks: The Core Mechanism

A neural network is a function that takes input numbers and produces output numbers, with adjustable parameters (called weights) in between. "Learning" means adjusting those weights to minimize errors.

The Training Loop (every AI model follows this)

1Forward pass: Feed input data through the network. It produces a prediction.
2Loss calculation: Compare the prediction to the correct answer. Measure how wrong it was (the "loss").
3Backpropagation: Calculate how each weight contributed to the error.
4Weight update: Adjust weights slightly to reduce the error (using gradient descent).
5Repeat: Do this billions of times across the entire dataset. Each full pass = one "epoch."

What Makes Deep Learning "Deep"

A shallow network has 1-2 hidden layers. A deep network has dozens to hundreds. Each layer learns increasingly abstract features:

Layer 1: Raw patterns (edges, individual characters)

Layer 2-5: Combinations (shapes, words, phrases)

Layer 10-20: Concepts (objects, syntax, meaning)

Layer 50+: Abstract reasoning (intent, context, nuance)

The Transformer Architecture (2017 — the breakthrough)

Before transformers, AI processed text word-by-word sequentially (slow, forgetful). The transformer introduced self-attention: the model can look at all words in a sentence simultaneously and learn which words relate to which.

Why it matters: Every major LLM today — GPT-4, Claude, Gemini, Llama — is a transformer. The 2017 Google paper "Attention Is All You Need" is the single most important AI paper of the decade.

Key Numbers (to have in your back pocket)

Metric	What It Means	Typical Values
Parameters	The adjustable weights in the model. More ≈ more capacity to learn.	GPT-4: ~1.8T, Llama 3: 8B-405B, Claude: undisclosed
Context Window	How much text the model can "see" at once (input + output).	Claude: 200K tokens. GPT-4: 128K. Gemini: 1M+
Tokens	Chunks of text (~0.75 words per token). The unit of measurement for LLMs.	This entire page ≈ 4,000 tokens
Training Data	Total text the model was trained on.	Typically trillions of tokens from books, web, code
Inference	Running the trained model to generate a response. What you pay for via API.	~$3-15 per million input tokens (varies by model)

Large Language Models — Plus SLMs & Frontier Models

What an LLM Actually Does

An LLM is a next-token predictor. Given a sequence of tokens, it predicts the probability distribution over all possible next tokens, then samples from that distribution. That's it. All the apparent "intelligence" emerges from doing this prediction extremely well over extremely large amounts of data.

The Three Phases of Building an LLM

Phase 1: Pre-training

→

Phase 2: Fine-tuning (SFT)

→

Phase 3: RLHF / Alignment

Phase 1 — Pre-training (costs $10M-$100M+): Feed the model trillions of tokens of text from the internet, books, code. Produces a "base model" — it can complete text but won't follow instructions.

Phase 2 — Fine-tuning (SFT): Train on curated instruction/response pairs. Teaches the model to be a helpful assistant.

Phase 3 — RLHF: Human raters rank multiple model responses. A reward model learns what humans prefer. This is what makes Claude polite, safe, and genuinely useful.

SLM vs LLM vs Frontier Model — Match the Model to the Task

The terms SLM (Small Language Model), LLM (Large Language Model), and FM (Frontier Model) aren't three separate categories — LLM is the umbrella term. But they're labeled differently because we use them differently.

⚡

SLM — Small Language Model

Parameters: <10B. Role: Efficient specialist. Fast, cheap, runs on-prem. Best for: document classification, code routing, summarization. Well-tuned SLMs can match bigger models at focused tasks. Examples: IBM Granite 4.0, Mistral small models.

🧠

LLM — Large Language Model

Parameters: 10B-100B+. Role: Generalist. Broad knowledge across many domains. Best for: complex customer support, nuanced reasoning, multi-domain synthesis. Runs in cloud/SaaS.

🚀

FM — Frontier Model

Parameters: 100B+. Role: Cutting-edge. Best reasoning, best at complex multi-step tasks, deep tool integration. Best for: autonomous incident response, agentic systems, complex planning. Examples: Claude Opus, GPT-5, Gemini Pro.

Decision heuristic: Use an SLM when you need speed, low cost, or on-prem control. Use an LLM when you need broad knowledge and nuanced reasoning. Use a Frontier Model when you need the absolute best complex reasoning for multi-step problems. Match the model to the task — don't use a sledgehammer for a thumbtack.

Key LLM Capabilities

In-Context Learning

Give the LLM examples in the prompt, it adapts behavior without retraining. "Few-shot" prompting.

Chain of Thought

Ask it to "think step by step" and accuracy on reasoning tasks jumps dramatically.

Tool Use / Function Calling

The LLM outputs structured JSON to call external APIs, databases, or tools. Foundation of agents.

RAG

Before answering, retrieve relevant documents from a database and inject them into the context. Reduces hallucination.

What LLMs Cannot Do

• No true memory: Each conversation starts fresh unless you engineer persistence.

• Hallucinations: They confidently state false things. Inherent to probabilistic generation.

• No real-time data: Knowledge is frozen at training cutoff unless you add retrieval tools.

• Math and precise logic: Unreliable for complex calculations without tool use. They approximate; they don't compute.

• Determinism: Same input can produce different outputs. Temperature controls randomness but never eliminates it.

The Open Model Ecosystem — Hugging Face & When to Use It

What is Hugging Face?

Hugging Face is the largest public repository of pre-trained AI models. Think of it as the “GitHub for machine learning” — a shared platform where research labs, companies, and independent developers upload trained models, datasets, and application demos. The models range from tiny text classifiers to massive LLMs.

What lives on Hugging Face?

Language models — Llama, Mistral, Gemma, BERT, and thousands of fine-tuned variants.
Imaging models — Stable Diffusion, BLIP, Vision Transformers.
Speech models — Whisper (ASR), Bark & XTTS (text-to-speech).
Embedding models — the backbone of RAG (e.g., all-MiniLM-L6-v2).
Multimodal models — models that jointly process text, images, and sometimes audio.

Base Models vs Fine‑Tuned Models — Where the Cost Really Goes

Training a model from random weights (a base model) demands enormous compute. For example, Google’s BERT needed 4 days on 64 TPUs at an estimated hardware cost of $50k–$100k, while today’s largest LLMs run into tens of millions of dollars just for the electricity and GPUs. These base models are built by well‑funded labs — Google, Meta, OpenAI, Mistral, Stability AI — who then release the finished weights publicly on Hugging Face.

When you “get a model from Hugging Face,” you almost never train from scratch. Instead, you download the open‑sourced weights and fine‑tune them on your own much smaller dataset. Fine‑tuning adjusts only the final layers (or a fraction of the total parameters) and can be done on a single GPU in hours for a few dollars. That is why a small team can build a custom medical‑document classifier or a support‑ticket router for a fraction of what the original training cost.

Base Model (Training from Scratch)

Who pays: Big Tech or well‑funded research groups
Compute: Hundreds of GPUs/TPUs for weeks or months
Cost: Millions of dollars
Output: A general‑purpose brain that understands language or images

Fine‑Tuned Model (Your Work)

Who pays: Your team
Compute: A single GPU for hours
Cost: Tens to hundreds of dollars
Output: A specialist that excels at one narrow task using the base model’s knowledge

Key insight: The expensive bit — learning grammar, common sense, visual features — has already been done. Hugging Face gives you a starting model that already knows what a sentence or an edge looks like. You spend a tiny amount to teach it your domain‑specific patterns.

When Should You Pull a Model from Hugging Face?

Enterprises and developers usually turn to Hugging Face in these concrete scenarios:

Situation	Why Hugging Face (instead of a closed API)
Data must stay on‑prem / in your VPC	Download an open‑source LLM (Llama, Mistral, Gemma) and run it on your own servers. No data ever leaves your infrastructure.
Task is narrow and high‑volume	A fine‑tuned BERT‑family model for classification can be 100× cheaper per query than calling a GPT‑4 API and often just as accurate.
Cost at scale	For millions of inference requests per day, self‑hosting a small model on your own GPU instance usually beats pay‑per‑token pricing.
Avoiding vendor lock‑in	Open models are portable — you can move them between clouds or run them on‑prem, and swap providers freely.
R&D / prototyping	Experiment with different architectures without API bills. Test accuracy, speed, and failure modes before committing to a production stack.
Transparency & auditability	You can inspect the model card, training data, and even run bias and safety checks on open models — impossible with a closed API.
Embeddings / RAG pipeline	Hugging Face hosts state‑of‑the‑art embedding models that convert text into vectors for semantic search, often the best‑performing options available.

Trade‑off: Self‑hosting an open model requires you to manage infrastructure, security, and model updates. If you lack in‑house ML engineering, a hosted API may still be the faster, safer choice.

How to Build a Model for Hugging Face (The Quick Version)

The typical workflow to create and share a model is straightforward:

1Define the task — text classification, image recognition, question‑answering, etc.
2Collect a dataset — labeled examples specific to your problem.
3Choose a pre‑trained base model from Hugging Face that is already close to what you need (e.g., BERT for text, ViT for images).
4Fine‑tune using the Hugging Face transformers library — a trainer handles the training loop.
5Evaluate on a held‑out test set to confirm performance.
6Package the trained weights, config, and tokenizer into a single folder.
7Write a model card (README) describing what the model does, its training data, limitations, and intended use.
8Upload via the Hugging Face Hub — either drag‑and‑drop on the website or a single command with huggingface_hub.

The entire process can be done in a few hours, and many teams start from a community‑shared Colab notebook that already does steps 2–6 with a single click.

The AI Provider Landscape

Foundation Model Providers

Provider	Models	Strengths	Access
Anthropic	Claude (Opus, Sonnet, Haiku)	Safety, long context (200K), instruction following, coding, analysis	API, claude.ai, AWS Bedrock, GCP Vertex
OpenAI	GPT-4o, o1, o3	Broad capabilities, vision, ecosystem, first-mover brand	API, ChatGPT, Azure OpenAI
Google	Gemini (Ultra, Pro, Flash)	Multimodal, huge context (1M+), integrated with Google Cloud	API, Gemini app, GCP Vertex
Meta	Llama 3/4	Open-source, self-hostable, fine-tunable, no vendor lock-in	Download weights, run anywhere
Mistral	Mixtral, Mistral Large	European, efficient, open-weight options	API, self-host

Open Source vs Closed Source — CTO Decision Framework

Factor	Closed (Claude, GPT-4)	Open (Llama, Mistral)
Performance	Generally best-in-class	Closing the gap rapidly
Cost	Pay per token (API)	Infra cost (GPUs) — can be cheaper at scale
Data privacy	Data sent to provider's API (enterprise tiers offer zero retention)	Runs on your infra — full control
Customization	Prompt engineering, some fine-tuning	Full fine-tuning, modify architecture
Maintenance	Provider handles everything	You own ops, updates, security
Best for	Fast deployment, best quality, small-medium scale	High volume, strict compliance, niche domains

AI Agents & Agent Skills

What is an AI Agent?

An AI agent is an LLM that can plan, use tools, observe results, and iterate — autonomously. Instead of just answering a question, it takes action to accomplish a goal. This is the shift from monolithic models to compound AI systems — where the model is integrated into existing processes with programmatic components around it.

Agent vs Automation vs Chatbot — The Critical Distinction

Feature	Traditional Automation (RPA)	Chatbot (rule-based)	AI Chatbot (LLM)	AI Agent
Decision making	None. Follows fixed rules.	Decision tree only	Flexible, but single-turn	Plans multi-step, adapts
Handles ambiguity	No — breaks on edge cases	No	Yes	Yes
Uses tools	Hardcoded integrations	No	If programmed to	Autonomously decides which tools
Memory	None	Session only	Session only	Short + long-term memory
Autonomy	Zero	Zero	Low	High — can loop, retry, escalate

The Agent Loop (ReAct Pattern)

1Observe: Receive user request or trigger event. Retrieve relevant context from memory.
2Think: The LLM reasons about what to do next. Creates a plan or picks the next action.
3Act: Call a tool — query a database, call an API, send a message, update a record.
4Observe: Check the result. Did it work? Was the data correct?
5Loop or stop: If the goal is met, respond. If not, go back to step 2 with updated context.

AI Agent Skills — Procedural Knowledge for Agents

LLMs know facts (semantic memory) but lack procedural knowledge — the step-by-step workflows specific to how work actually gets done. Agent Skills solve this by packaging procedural knowledge into a simple, portable format.

What a Skill Looks Like

A skill is simply a skill.md file in a folder. At minimum it has:

Name — identifies the skill
Description — tells the agent when to use this skill (the trigger condition)
Body — step-by-step instructions, rules, examples in plain markdown

Optional folders: scripts/ (executable Python/JS/Bash), references/ (additional docs), assets/ (templates, data files).

Progressive Disclosure — Three Tiers

When an agent has hundreds of skills, loading all of them into the context window would blow the token budget. Skills use progressive disclosure:

Tier 1 — Metadata only: At startup, loads just name + description from each skill. A "table of contents." Handful of tokens per skill.

Tier 2 — Full instructions: When a request matches the skill's description, reads the complete skill.md body into context. The LLM decides when to use it.

Tier 3 — Resources: Scripts, references, and assets only loaded when a specific task needs them.

How skills relate to MCP and RAG: MCP gives agents tool access (what the agent can reach). RAG gives factual knowledge (reference material). Skills give procedural knowledge — how to do things, in what order, with what judgment. Skills often use MCP tools, with the skill providing the judgment for when and how to invoke them. The skill.md format is an open standard (Apache 2.0) at agentskills.io, adopted across Claude Code, OpenAI Codex, and many other platforms.

Trust warning: Skills can include executable scripts with access to file systems and API keys. Always review a skill before installing it — audits have found prompt injection, tool poisoning, and hidden malware in publicly available skills. Treat skill installation like any software dependency.

Agent Architecture Components

🧠

LLM Core

The reasoning engine. Chooses actions, interprets results.

🔧

Tools

Functions the agent can call: APIs, DB queries, web search, calculators.

💾

Memory

Short-term: conversation context. Long-term: vector DB of past interactions.

🛡️

Guardrails

Rules constraining what the agent can do. Approval workflows for high-stakes actions.

📊

Orchestrator

Manages the agent loop. Handles retries, timeouts, error handling.

👁️

Observability

Logging every step: what the agent thought, what tools it called, what it returned.

Building AI Applications — RAG, CAG & Multimodal RAG

The AI Application Stack

User Interface — Chat, voice, embedded UI, API endpoints

Application Layer — Business logic, auth, rate limiting, caching

Orchestration — Agent framework, prompt management, tool routing

RAG Pipeline — Document ingestion, embeddings, vector search, reranking

Model Layer — LLM API calls or self-hosted models

Infrastructure — GPUs, vector DB, object storage, monitoring

RAG: The Most Common Enterprise AI Pattern

RAG (Retrieval-Augmented Generation) is how you make an LLM answer questions about your data without retraining the model. It's a compound AI system: the model queries an external searchable knowledge base, retrieves relevant documents, and uses them as context for generation.

How RAG Works — Step by Step

1Ingest documents: Take your internal docs (PDFs, Confluence, Slack, CRM). Split into chunks (200-500 tokens each).
2Create embeddings: Run each chunk through an embedding model. Converts text to a numerical vector (~1500 numbers) capturing semantic meaning.
3Store in vector database: Store vectors in a vector DB (Pinecone, pgvector, Qdrant, Chroma). Enables fast similarity search.
4At query time: Convert the user's question to a vector using the same embedding model.
5Search: Find the top 5-20 most similar document chunks in your vector DB.
6Augment: Insert those chunks into the LLM prompt as context.
7Generate: The LLM answers based on the retrieved context, dramatically reducing hallucination.

RAG at Scale: Millions of Documents

When people talk about RAG over millions of PDFs, they're describing a search system + an LLM — not just a vector database demo.

1. Ingestion (offline)

Take PDFs from cloud storage. OCR if scanned. Clean up text. Split into chunks of ~512-2000 words with overlap. Attach metadata — document ID, page number, section title, date.

2. Embeddings + Index

Run dedicated embedding jobs (GPUs or large batches). Build a distributed index (Milvus, Qdrant, Vespa, Elasticsearch). Use efficient algorithms (HNSW, IVF, PQ). Metadata filtering happens first — vector search helps rank, not replace, filtering.

3. Retrieval + Generation

User query

→

Filter by metadata

→

Check cache

→

Vector search

→

Rerank

→

LLM

4. Caching

Cache: query → final answer; query → list of relevant chunk IDs; hot data (frequently used vectors). Real request path: User query → cache → if not found → retrieve + LLM → save to cache.

5. Monitoring

Track retrieval quality, answer quality (thumbs up/down), latency, cache hit rate. Re-embed and re-shard when models or data change.

CAG: Cache Augmented Generation — An Alternative to RAG

CAG takes a different approach: instead of retrieving knowledge on demand, you preload the entire knowledge base into the model's context window all at once. The model processes everything in a single forward pass and stores its internal state (the KV cache — key-value cache). Subsequent queries use this cached state without reprocessing all the text.

RAG — Retrieve on Demand

Knowledge base: Can be massive (millions of docs). Only retrieves small pieces at a time.

Latency: Higher — extra retrieval step per query.

Data freshness: Easy — update the index incrementally.

Best for: Large, dynamic knowledge bases; when citations are needed.

CAG — Preload Everything

Knowledge base: Constrained by context window size (32K-100K tokens typical).

Latency: Lower — no retrieval lookup; one forward pass.

Data freshness: Requires recomputation when data changes.

Best for: Small, static knowledge bases; when low latency matters.

RAG or CAG? Use RAG when your knowledge source is very large, frequently updated, or you need citations. Use CAG when you have a fixed set of knowledge that fits within the context window, latency is critical, and you want simpler deployment. For complex scenarios (like clinical decision support), a hybrid approach works: RAG to retrieve the relevant subset, then CAG to create temporary working memory for follow-up questions.

Multimodal RAG — Three Approaches

Real-world data isn't just text. It includes network diagrams, screenshots, scanned PDFs, videos, and audio. Multimodal RAG extends retrieval to handle multiple data modalities.

Approach 1: Text-ify Everything RAG

Convert all modalities to text first. Images → captions via captioning model. Audio/video → transcripts via STT. Then use standard text RAG. Easy but loses visual context and spatial relationships.

Approach 2: Hybrid Multimodal RAG

Retrieval is still text-based (search over captions + transcripts), but the LLM receives the original non-text data (images, audio clips) alongside retrieved text. The multimodal LLM reasons over everything together. Retrieval is only as good as the text captions.

Approach 3: Full Multimodal RAG

Uses a shared vector space — text, images, and audio all get embedded into the same space. A single query vector can directly retrieve text paragraphs, diagrams, and video frames. Most powerful but highest cost and complexity.

Native Multimodality vs Feature-Level Fusion

Feature-level fusion: A separate vision encoder extracts features from images and passes numerical representations to the LLM. The LLM only sees a summarized description, not the raw signal. Cheaper but information can be lost.

Native multimodality: All modalities (text, images, audio, video) are tokenized and embedded into a shared vector space. The model attends to everything simultaneously. For video, this uses spatial-temporal patches — 3D cubes capturing motion across frames, not just flat squares. Native models can also do any-to-any generation: take in any combination of modalities and output any combination.

RAG pitfall: "Garbage in, garbage out" applies fully. Data quality work is 60% of a RAG project. If your documents are messy or out of date, RAG will confidently retrieve bad information.

Key Frameworks & Tools

LangChain

Python/JS framework for LLM apps. Chains prompts, tools, memory, retrievers. Widely used but can be over-abstracted.

LlamaIndex

Focused specifically on RAG. Better for document indexing and retrieval pipelines.

CrewAI / AutoGen

Multi-agent frameworks. Define agents with different roles that collaborate on complex tasks.

Vercel AI SDK

Lightweight SDK for building AI chat UIs in Next.js/React. Handles streaming, tool calls.

Practical: Building an AI Feature (Simplified)

Building "An AI that answers customer questions using your knowledge base":

Step	What You Do	Tools/Services
1. Data prep	Export KB articles, clean HTML, split into chunks	Python, Unstructured.io
2. Embeddings	Generate vector embeddings for each chunk	OpenAI Embeddings, Cohere, Voyage AI
3. Vector store	Store embeddings with metadata	Pinecone, pgvector, Qdrant
4. Retrieval	Build search: query → vector → top-k similar chunks	Vector DB SDK + Cohere Rerank
5. Prompt	System prompt with role + injected context chunks	Prompt templating
6. LLM call	Send assembled prompt to Claude/GPT-4 API	Anthropic API, OpenAI API
7. UI	Chat interface with streaming	React, Vercel AI SDK
8. Guardrails	Input validation, output filtering	Custom rules, Guardrails AI
9. Observability	Log requests, responses, latency, cost	LangSmith, Langfuse, Datadog
10. Evaluation	Measure answer quality, hallucination rate	Human review, RAGAS framework

Multimodal AI — How Models See, Hear & Understand

What is Multimodal AI?

A modality is a type of data — text, images, audio, video, LIDAR, thermal imaging. A multimodal AI model can ingest and/or generate multiple data modalities. Instead of just tokenizing text strings, it can process a screenshot alongside a text description, or generate a video from a text prompt.

Native Multimodality: The Shared Vector Space

In a natively multimodal model, all modalities are tokenized and embedded into the same high-dimensional space. Text words become vectors. Image patches become vectors. Audio chunks become vectors. Because they all live in the same space, a picture of a cat ends up near the word "cat" — and the model can reason about them together without translating between different systems.

Video & Temporal Reasoning

Native video models use spatial-temporal patches — 3D cubes that capture an area across a short window of time (e.g., 8 video frames). Motion is baked into the token itself, not guessed by comparing separate images. This enables the model to understand actions like "picking up" vs "putting down."

Any-to-Any Generation

Because all modalities share the same vector space, multimodal models can do any-to-any generation: take in any combination of modalities and output any combination. You could ask the model to explain how to tie a tie — it generates text instructions and a short video clip, all coherent because everything lives in the same shared space.

Feature-Level Fusion vs Native Multimodality

Approach	How It Works	Pros	Cons
Feature-Level Fusion	Separate vision encoder extracts features; passes numerical array to LLM	Cheaper, easier to swap parts	Information lost in transfer; LLM sees summary, not raw data
Native Multimodality	All modalities tokenized into shared vector space; model attends to everything simultaneously	Richer understanding; model knows where to look based on the question	More compute, more complexity

Why this matters: With feature-level fusion, the vision encoder processes your image before it knows what question you're asking — it might compress away the exact detail you need. With a shared vector space, the model attends to text and images simultaneously, so it knows where to look. Ask about a tiny icon in the corner of a screenshot, and the model can focus attention there.

Voice AI — Calls, IVR, and Conversational Agents

How Voice AI Works End-to-End

🎤 User speaks

→

ASR/STT

→

LLM processes text

→

TTS

→

🔊 User hears response

ASR/STT: Converts spoken audio to text. Leading: OpenAI Whisper (open source), Google Cloud Speech, Deepgram (low latency), AssemblyAI.

LLM Processing: Same as any text-based AI. Transcribed text is the input.

TTS: Converts LLM's text response to natural speech. Leading: ElevenLabs (most natural), OpenAI TTS, Google Cloud TTS, Play.ht, Cartesia (ultra-low latency).

Voice AI Use Cases

Welcome / Outbound Calls

AI calls new customers to welcome them, walk through onboarding. Platforms: Bland.ai, Retell AI.

Customer Support IVR

Replace "Press 1 for billing" with natural conversation. AI understands intent, resolves or routes.

Appointment Scheduling

AI calls to confirm/reschedule appointments. Used in healthcare, salons, auto services.

Sales Qualification

AI calls inbound leads, asks qualifying questions, logs to CRM. Example: Air AI, Vapi.

Key Decisions for Voice AI

Decision	Options	Trade-off
Latency target	<500ms feels natural, >1s feels robotic	Lower latency = more expensive, requires streaming ASR+TTS
Build vs buy	Platforms: Vapi, Retell, Bland.ai. Build: Twilio + ASR + LLM + TTS	Platforms faster to ship. Custom gives full control.
Interruption handling	Must detect user speaking mid-response and stop gracefully	Hard to get right. Requires VAD (Voice Activity Detection).
Phone integration	Twilio, Vonage, Plivo for SIP/PSTN	Twilio is most mature. Costs per minute apply.

Enterprise AI — Practical Considerations & Technical Debt

Data Privacy & Security

Zero Data Retention (ZDR): Enterprise API tiers guarantee your data is not used for training and is not retained. Verify this in your contract.

Data residency: Run Claude via AWS Bedrock in your preferred region. Data never leaves your VPC. Same with Google Vertex AI.

PII handling: Redact PII before sending, or use enterprise tiers with DPAs. Check SOC2 Type II, HIPAA BAA compliance.

Cost Management

LLM API Pricing Model

You pay per token — both input (prompt) and output (response). ~$0.01-0.05 per ~3500 tokens depending on model.

Cost Optimization Strategies

Model routing: Use a cheap/fast model (Claude Haiku, GPT-4o mini) for simple queries. Route complex queries to the expensive model. Can cut costs 60-80%.

Caching: Cache frequent queries. Anthropic offers prompt caching — reused system prompts cost a fraction.

Shorter prompts: Every token in your system prompt is charged on every request. Optimize for brevity.

Batch processing: For non-real-time tasks, use batch APIs at 50% discount.

AI Technical Debt — The Elephant in the Room

AI technical debt is trading off speed now for costs later — future cost from present shortcuts. It's the interest you pay because you didn't make a large enough down payment upfront. In AI, debt compounds even faster because AI is probabilistic, context-dependent, and moves extremely fast.

Strategic vs Reckless Technical Debt

Strategic: Taken consciously. You know the risks, they're documented, time-bound, with a remediation plan. A valid way to get to market fast.

Reckless: Poor discipline. No documentation, no remediation plan, no future — just a mess headed your way.

Four Categories of AI Technical Debt

1. Data Debt

Garbage in = amplified garbage out. Unvetted sources, bias, data drift, poisoning, no anonymization. Fix: Vet sources, check for bias, monitor drift, anonymize PII.

2. Model Debt

No version control, no evaluation metrics, no rollback ability, no penetration testing. Fix: Version models, set eval metrics, plan rollbacks, pen-test.

3. Prompt Debt

Undocumented system prompts, no input validation, prompt injection vulnerabilities, data leakage via outputs, no guardrails. Fix: Document prompts, validate inputs, use an AI gateway with input/output filtering.

4. Organizational Debt

Unclear ownership, no governance policy, no red teaming, scalability issues, latency surprises at scale. Fix: Define ownership, establish governance, red-team, plan for scale.

The result of unchecked AI technical debt: An AI you don't trust. "Ready, fire, aim" doesn't work. The project lifecycle hasn't changed just because it's AI: Requirements → Architecture → Implementation → Testing → Deployment → Evaluation → feed back to Requirements. AI technical debt = speed minus discipline, with massive compounding interest.

Team Structure for AI

Role	What They Do	Typical Background
AI/ML Engineer	Builds pipelines, integrates LLMs, manages RAG, fine-tuning	Software engineer + ML experience
Prompt Engineer	Designs and optimizes system prompts, evaluates output quality	Domain expert + writing skill
Data Engineer	Prepares, cleans, and pipelines data for RAG / training	Data engineering, ETL
Platform/Infra	Manages GPUs, vector DBs, model serving, observability	DevOps / SRE with ML infra

Salesforce AI Ecosystem

Salesforce AI Architecture

Agentforce — Pre-built + custom AI agents for sales, service, marketing

Einstein Copilot — Conversational AI assistant inside Salesforce UI

Einstein GPT + Prompt Builder — Customizable AI generation in flows

Trust Layer — Prompt defense, toxicity filtering, PII masking, audit logging

Data Cloud — Unified customer data, real-time, feeds into AI as context

Foundation Models — Salesforce's own + OpenAI, Anthropic, Google, Cohere via gateway

What Salesforce "Agentforce" Actually Is

Agentforce = LLM (multi-model routing) + Salesforce data (CRM, Data Cloud, Knowledge Base) + Tools (Salesforce actions: create case, update opportunity, send email) + Guardrails (Trust Layer + business rules) + Deployment (Service Cloud, Sales Cloud, web, Slack).

What Questions to Consider

• "How does Agentforce handle multi-step tool failures and retries?"

• "What's the latency overhead of the Trust Layer on each LLM call?"

• "Can I bring my own model (BYOM) and still use the orchestration layer?"

• "How does Data Cloud grounding work — is it RAG under the hood? What embedding model?"

• "What's the pricing model — per agent, per conversation, per action?"

• "How do I evaluate agent quality? Is there built-in testing/evaluation tooling?"

• "What observability do I get — can I see every reasoning step, tool call, and retrieval?"

Risks, Limitations & What Can Go Wrong

Hallucination

LLMs generate plausible-sounding false information. Mitigation: RAG, citations, confidence scoring, human-in-the-loop.

Prompt Injection

Malicious inputs that override system instructions. Mitigation: input sanitization, separate system/user contexts, output validation.

Data Leakage

Model reveals training data or other users' data. Mitigation: data isolation, output filtering, PII redaction.

Bias

Models inherit biases from training data. Mitigation: bias testing, diverse data, human oversight.

Cost Overruns

Token costs spike unexpectedly. One rogue agent loop burns budget. Mitigation: per-request budgets, circuit breakers, cost monitoring.

Vendor Lock-in

Building on one provider's API creates dependency. Mitigation: abstract the model layer, support model switching.

CTO Strategy — Where to Start & LLM vs Agent Decision Framework

The Pragmatic Adoption Ladder

1Internal productivity (low risk, high value): Deploy an LLM chatbot connected to your internal knowledge base. Quickest win with least risk.
2Customer-facing copilot (medium risk): AI that helps customers with common questions, grounded in your help center. Always with a "talk to human" escape hatch.
3Process automation agents (higher risk): Agents that take actions — update records, send emails, process refunds. Requires guardrails, approval workflows.
4Autonomous agents (highest complexity): Multi-step agents handling end-to-end workflows with minimal human oversight. Only after robust evaluation, monitoring, and rollback.

LLM vs Agent — When Simple is Better

A common mistake is building an elaborate agent with multi-step planning and tool use when a single LLM prompt would do the job faster and cleaner. Sometimes simple is better.

Use a Single LLM When…

• Task is single-step (quick answer, one-off task)

• Low complexity — no need for planning or external tools

• Speed matters — want fast results without overhead

• Examples: writing an email, summarizing a document, translating text, generating ideas, simple code snippets

Use an Agent When…

• Task requires multi-step reasoning and planning

• Need to use tools — APIs, databases, external systems

• Autonomy is required — the system decides what steps to take and in what order

• Examples: automating workflows, researching competitors + compiling + emailing reports, debugging + testing + deploying code

The LLM vs Agent heuristic: If you can describe the task as a single question with a single answer, use an LLM. If the task requires a workflow — pull data, run analysis, create a chart, email it — use an agent. Next time you're building with AI, ask: do I really need an agent, or will a simple LLM do?

The #1 mistake CTOs make: Starting with a model choice instead of starting with the problem. Pick a specific, measurable business problem first. Then figure out which AI approach solves it. Often you don't need the most expensive model — or any model at all.

Quick Reference: The AI Glossary

Term	Plain English
Token	A chunk of text (~¾ of a word). The unit LLMs process and bill by.
Embedding	Converting text to a list of numbers that capture meaning. Similar texts → similar numbers.
Vector Database	A database optimized for storing and searching embeddings by similarity.
RAG	Retrieval-Augmented Generation. Look up relevant docs, then feed them to the LLM.
CAG	Cache Augmented Generation. Preload entire knowledge base into context; use KV cache for fast queries.
Fine-tuning	Additional training on specific data to specialize a model. Expensive, usually unnecessary.
Prompt Engineering	Crafting the instructions (system prompt) to get the best output from an LLM.
Temperature	Controls randomness. 0 = deterministic, 1 = creative. Use low for facts, higher for brainstorming.
Context Window	Max text the model can process at once. Bigger = more info per request, but more expensive.
RLHF	Reinforcement Learning from Human Feedback. How models learn to be helpful vs. just technically correct.
Inference	Running a trained model to get a prediction/response. The thing you pay for in production.
Hallucination	When the model confidently outputs false information.
Guardrails	Rules that constrain AI behavior — what it can/can't do, say, or access.
MCP	Model Context Protocol. Open standard for connecting LLMs to external tools and data sources.
Function Calling	The LLM outputs structured data to invoke external tools/APIs. Mechanism behind agents.
Agentic	AI that can plan, act, observe, and loop — not just respond to a single prompt.
SLM	Small Language Model. <10B parameters. Fast, cheap, on-prem specialist.
Frontier Model	Most capable models today. Best reasoning, best at complex multi-step tasks.
KV Cache	Key-Value Cache. Model's internal state after digesting documents; used in CAG.
Progressive Disclosure	Loading skill metadata first, full instructions when relevant, resources at point of need.
ReAct	Reasoning + Acting. The think-act-observe loop pattern used by most AI agents.
Shared Vector Space	Single embedding space where text, images, and audio all coexist; enables native multimodality.